Use scoped rayon pool for backfill chain segment processing #7924

jimmygchen · 2025-08-22T15:42:35Z

Issue Addressed

Part of #7866

Continuation of Use rayon to speed up batch KZG verification #7921

In the above PR, we enabled rayon for batch KZG verification in chain segment processing. However, using the global rayon thread pool for backfill is likely to create resource contention with higher-priority beacon processor work.

Proposed Changes

This PR introduces a dedicated low-priority rayon thread pool LOW_PRIORITY_RAYON_POOL and uses it for processing backfill chain segments.

This prevents backfill KZG verification from using the global rayon thread pool and competing with high-priority beacon processor tasks for CPU resources.

However, this PR by itself doesn't prevent CPU oversubscription because other tasks could still fill up the global rayon thread pool, and having an extra thread pool could make things worse. To address this we need the beacon
processor to coordinate total CPU allocation across all tasks, which is covered in:

Allocate multiple threads to column reconstruction task #7789

…ckfill-verify-kzg-use-scoped-rayon

eserilev · 2025-08-29T06:35:30Z

In the non-scoped rayon pool case, and zero delay reconstruction, i did see the node briefly become unsynced. Additionally I saw some instances of these log messages

WARN  Delayed proposer preparation                  prepare_slot: 263808, validator: 72
WARN  Error signalling fork choice waiter           error: ForkChoiceSignalOutOfOrder { current: Slot(263831), latest: Slot(263830) }, slot: 263830

I also saw spikes in gossip attestation times (up to 10s)

The numbers dropped to the 20ms range as soon as I switched to scoped rayon pool usage (at the 6:00-ish mark in the screenshot above). I also saw no sync issues or WARN log messages. This is also when column reconstruction is set to zero delay (i.e. reconstructing as soon as we get half the columns)

With scoped rayon pool usage reconstruction during backfill slows down a bit (as expected)

eserilev · 2025-09-01T02:06:15Z

I've done a bunch of testing on backfill with the global rayon pool vs scoped rayon pool usage. The biggest difference is that KZG verification takes more than double the time with a scoped rayon pool (using a max of~25% of cpu threads) vs the global rayon pool. Additionally I did see lower cpu usage on average in the scoped pool case. My node did not have any issues following the chain in either case during backfill. So I can't argue that this change is absolutely necessary. But it is a safe and relatively simple optimization to make. It can potentially help protect nodes from having issues during backfill sync, so maybe thats a good enough reason to include this in our 8.0.0 release candidate

I haven't see any evidence that something like #7789 is required. It seems like the OS scheduler is good enough at figuring things out with the current scoped rayon pool usage. If we were to expand our scoped rayon pool usage to other work events, #7789 could potentially become more relevant.

In a future iteration (or in this PR) we could make the low priority rayon pool configurable via cli flag to give users additional control over the speed of backfill. This is probably unnecessarily at the moment, but could potentially become useful if we expand scoped rayon thread pool usage.

Another TODO could be to introduce a high priority rayon pool and always avoid scheduling rayon tasks on the global thread pool. It remains to be seen if that would be useful considering our current rayon usage.

mergify · 2025-09-01T02:57:36Z

Some required checks have failed. Could you please take a look @jimmygchen? 🙏

…is disabled

eserilev · 2025-09-02T00:31:20Z

I've added a high priority rayon pool thats used during backfill when the flag --disable-backfill-rate-limiting is used. This effectively speeds up kzg verification 2x vs using the low priority queue. The flag itself already warns that the node may have degraded attestation performance, so I dont think we need to warn users more than we already are.

…ckfill-verify-kzg-use-scoped-rayon

eserilev · 2025-09-02T04:12:08Z

Something about the changes to the backfill work event are causing a test to fail in a non-deterministic manner. I'm wondering if its because the backfill task is now blocking instead of async?

…ckfill-verify-kzg-use-scoped-rayon

eserilev · 2025-09-02T20:15:16Z

I made a small tweak to assert_event_journal_with_timeout to ignore WORKER_FREED and NOTHING_TO_DO events for the test case that was failing non-deterministically.

mergify · 2025-09-02T20:56:45Z

Some required checks have failed. Could you please take a look @jimmygchen? 🙏

jimmygchen · 2025-09-09T08:08:27Z

Hi @ethDreamer would you mind reviewing this please 🙏

jimmygchen

@eserilev looks great, I've added some comments. Let me know what you think!

jimmygchen · 2025-09-11T05:34:50Z

beacon_node/beacon_processor/src/rayon_manager.rs

+    pub low_priority_threadpool: Arc<ThreadPool>,
+    /// Larger rayon thread pool for high-priority, compute-intensive tasks.
+    /// By default 100% of CPUs.
+    pub high_priority_threadpool: Arc<ThreadPool>,


If we're allocating work to this thread pool, we might need to be a bit careful with using the global thread pool, because now we can allocate work to:

global thread pool with num_cpus threads

high_priority_threadpool with num_cpus threads

low priority threadpool with 2 min threads

so we now let rayon use 2 * num_cpus + 2 threads, which may even cause more oversubscription.

I see a few into_par_iter usages, but i think the only compute intensive ones are reconstruction and verify_cell_proof_batch

so maybe we should spawn those tasks with this thread pool too?

I dont think rayon is doing 2 * num_cpus threads if we schedule one work event on the global pool and one work event on the high priority thread pool.

This is an article from one of the rayon maintainers
https://smallcultfollowing.com/babysteps/blog/2015/12/18/rayon-data-parallelism-in-rust/

It's a bit old, but he explains the work stealing technique that rayon does to find idle threads. the tldr there being that rayon just constantly tries to steal idle threads to execute tasks in parallel.

I also checked with our AI overlords, and they mentioned that work scheduled in the global pool might have a bit more difficulty stealing threads from work scheduled in the scoped pool (and vice versa).

As far as i can tell I think we could actually probably do away with the high priority thread pool (unless we want it to be num_cpus < max_cpus) as it doesn't really provide any benefit vs the global thread pool.

i haven't dug enough, but my understanding was:

work-stealing only exists between threads in the same threadpool

"idle thread" means the rayon thread has no task in it's local work queue (not the phyiscal CPU core)

yeah I'm just thinking that the global thread pool is providing the same functionality as the high prio thread pool with a bit less effort code wise. Because if we wanted to run something like reconstruction in the scoped high prio thread pool, I think we'd need to make the reconstruction work event a blocking task (like what you did with ChainSegmentBackFill)

Yeah agree!

jimmygchen · 2025-09-11T05:44:51Z

beacon_node/beacon_processor/src/rayon_manager.rs

+use std::sync::Arc;
+
+pub const DEFAULT_LOW_PRIORITY_DIVISOR: usize = 4;
+const MINIMUM_LOW_PRIORITY_THREAD_COUNT: usize = 2;


Maybe we can even lower this to 1 (suggested by @ethDreamer)
With EIP-7870, the min requirement for a full node is 4 phyiscal cores / 8 threads, so in production envs - we should get 2 at the minimum.

However, if the process is run on a containerised environment with CPU limits (eg. for devnet testing), then we might end up allocating more than we should, e.g. if max_cpu=4, we'd end up allocating half of the available threads.

i think it probably safer to make this as low as possible (1), and this should be fine because as the name suggests, it's low priority - so running on a single thread should be fine.

beacon_node/beacon_processor/src/rayon_manager.rs

jimmygchen · 2025-09-11T05:56:51Z

beacon_node/network/src/network_beacon_processor/tests.rs

@@ -623,11 +625,21 @@ impl TestRig {
        &mut self,
        expected: &[&str],
        timeout: Duration,
+        ignore_worker_freed: bool,
+        ignore_nothing_to_do: bool,


I made a small tweak to assert_event_journal_with_timeout to ignore WORKER_FREED and NOTHING_TO_DO events for the test case that was failing non-deterministically.

Do you think the non-deterministism was due to backfill rate limiting? would disabling it fixes it and speed up the tests?

let beacon_processor_config = BeaconProcessorConfig { enable_backfill_rate_limiting: false, ..Default::default() };

jimmygchen added blocked work-in-progress PR is a work-in-progress do-not-merge optimization Something to make Lighthouse run more efficiently. labels Aug 22, 2025

jimmygchen changed the base branch from stable to unstable August 22, 2025 15:43

Use scoped rayon pool for chain segment backfill.

47a80e5

jimmygchen force-pushed the backfill-verify-kzg-use-scoped-rayon branch from 0e9d888 to 47a80e5 Compare August 22, 2025 15:43

Merge branch 'unstable' of https://github.com/sigp/lighthouse into ba…

5cbfdb5

…ckfill-verify-kzg-use-scoped-rayon

jimmygchen mentioned this pull request Aug 28, 2025

Use rayon to speed up batch KZG verification #7921

Merged

Merge branch 'unstable' of https://github.com/sigp/lighthouse into ba…

cc1b870

…ckfill-verify-kzg-use-scoped-rayon

eserilev added 2 commits August 31, 2025 18:44

Create RayonManager struct

28f6ad2

Linting

583aaba

eserilev marked this pull request as ready for review September 1, 2025 01:59

eserilev requested a review from jxs as a code owner September 1, 2025 01:59

eserilev added ready-for-review The code is ready for review v8.0.0-rc.0 Q3 2025 release for Fusaka on Holesky das Data Availability Sampling and removed blocked work-in-progress PR is a work-in-progress do-not-merge labels Sep 1, 2025

Fix comment

16a1b48

mergify bot added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels Sep 1, 2025

Introduce a high prio threadpool for use when backfill rate limiting …

f81f5ea

…is disabled

eserilev added 2 commits September 1, 2025 17:44

Fix test

067b599

Merge branch 'unstable' of https://github.com/sigp/lighthouse into ba…

23bd8bf

…ckfill-verify-kzg-use-scoped-rayon

eserilev added 2 commits September 2, 2025 10:47

Merge branch 'unstable' of https://github.com/sigp/lighthouse into ba…

4e884bb

…ckfill-verify-kzg-use-scoped-rayon

fix test

f1144e4

eserilev added ready-for-review The code is ready for review and removed waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels Sep 2, 2025

mergify bot added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels Sep 2, 2025

Merge branch 'unstable' into backfill-verify-kzg-use-scoped-rayon

ba65b98

mergify bot added ready-for-review The code is ready for review and removed waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels Sep 3, 2025

jimmygchen requested a review from ethDreamer September 9, 2025 02:36

jimmygchen commented Sep 11, 2025

View reviewed changes

jimmygchen added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels Sep 11, 2025

jimmygchen assigned eserilev Sep 11, 2025

Remove scoped high prio thread pool in favor of the global thread pool

b5ff451

Use scoped rayon pool for backfill chain segment processing #7924

Are you sure you want to change the base?

Use scoped rayon pool for backfill chain segment processing #7924

Uh oh!

Conversation

jimmygchen commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue Addressed

Proposed Changes

Uh oh!

eserilev commented Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eserilev commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Sep 1, 2025

Uh oh!

eserilev commented Sep 2, 2025

Uh oh!

eserilev commented Sep 2, 2025

Uh oh!

eserilev commented Sep 2, 2025

Uh oh!

mergify bot commented Sep 2, 2025

Uh oh!

jimmygchen commented Sep 9, 2025

Uh oh!

jimmygchen left a comment

Choose a reason for hiding this comment

Uh oh!

jimmygchen Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

jimmygchen Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

eserilev Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jimmygchen Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

eserilev Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jimmygchen Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

jimmygchen Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jimmygchen Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jimmygchen commented Aug 22, 2025 •

edited

Loading

eserilev commented Aug 29, 2025 •

edited

Loading

eserilev commented Sep 1, 2025 •

edited

Loading

eserilev Sep 15, 2025 •

edited

Loading

eserilev Sep 15, 2025 •

edited

Loading